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Abstract — Efficient and intelligent music information retrieval 
(MIR) is a need of the 21 st century. MIR addresses the problem 
of querying and retrieving certain types of music from large 
music data set. A singing voice is one of the key elements of 
music. As most part of music is characterized by the performing 
singer, analysis of singing voice reveals many characteristics 
of a song. The unique qualities of a singer's voice make it 
relatively easy for carrying out numerous tasks in MIR. The 
singing voice is completely characterized by its acoustic 
features. Acoustic features like timbre, vibrato, pitch and 
harmony describe the singing voice in the music and these 
are discussed in the paper. There are many applications of 
MIR which considers overall features of music, but the paper 
presents a review of those applications of MIR concerned 
directly to singing voice in the music. Also the paper lists the 
feature extraction methods and identifies the suitable feature 
appropriate for individual task of MIR. 

Index Terms — Acoustic Feature Extraction, Classifier, Music 
Information Retrieval, Singing voice, vocal / non-vocal 

I. Introduction 

As a major product for entertainment, there is a huge 
amount of digital musical content produced, broadcasted, 
distributed and exchanged. This growing demand of amount 
of music exchange using the internet, and the simultaneous 
interest of the music industry to find proper means to deal 
with the new way of distribution, has motivated research 
activity in the field of MIR [1]. There is a rising demand for 
music search services. Technologies are demanding for 
efficient categorization and retrieval of these music 
collections, so that consumers can be provided with powerful 
functions for browsing and searching musical content [2]. 

A singing voice is one of the key elements of music. 
Singing voice is one of the less studied vocal expressions 
and analyzing singing voice is an interesting challenge. 
Singing voice differs from every day speech in its intensity 
and dynamic range, voiced duration (95% in singing voice 
whereas 60%in speech), formant structures and pitch. 
Moreover the singing voice has loud, non-stationary 
background music signal which makes its analysis relatively 
more complex than speech [3], [4]. Thus speech processing 
techniques that were designed for general speech are not 
always suitable for the singing voice. But major of the earlier 
work have tried to extend speech processing to the problem 
of analyzing music signals. 
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MIR performs the task of classification in which it assigns 
labels to each song based on genre, mood, artists, etc. Those 
tasks directly related to song classification analyzing the 
singing voice are Singer Identification, Singer Verification, 
Music annotation etc. Other extended applications involve 
distinguishing between trained and untrained singer, analyse 
vocal quality, vocal enhancement etc. The paper provides an 
overview of features and techniques used for the above 
classification tasks. It provides a summary of different 
applications based on singing voice and maps the application 
to its best suitable acoustic feature, the extraction method of 
that feature and the appropriate classifier. The performance 
parameters that are essential to evaluate the system are also 
presented. 

II. Basic Framework For Singing Voice Analysis 

The singing voice, in addition to being the oldest musical 
instrument, is also complex from its acoustic standpoint. 
Processing of the singing voice in a music signal basically 
involves three major fundamental components: Separation 
of vocal and non-vocal segments of song, feature extraction 
from singing part (i.e. vocal segments) which involves 
analysis of acoustic features and a trained classifier that 
performs the task of classification and assigns the song to 
class of the problem. 

Separation of vocal and non-vocal segments is an 
essential component, as most singing voices in popular 
music are accompanied by musical instruments during vocal 
passages. Thus the feature vectors extracted from such vocal 
passages get influenced by the sounds of accompanying 
instruments. Interference of the instrumental background 
make an acoustic classifier a poor match to the acoustics of 
the sung vocal line [5], [6], [7]. Hence, most of the researchers 
prefer using a cappella singing (i.e. with no instrumental 
background) voice for feature extraction [8]. For accurate 
analysis of singing voice, it is essential to have a separation 
of singing voice from the accompanied background sound. 
Fig. 1 shows the basic steps involved in extracting features 
from singing voice. 

The commonly used techniques for vocal separation are; 
extraction of feature parameters based on the distribution of 
energy in different frequency bands, trained hidden Markov 
models (HMM) as vocal and non-vocal acoustic models, 
application of melody transcription system to each frame and 
estimate whether significant melody line is present. 
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Figure 1. Basic blocks involved in processing of singing voice 

After the segmentation of song into vocal/n on -vocal 
regions, features are extracted from the vocal sections. 
Features are extracted using mathematical transformations 
like, Wavelet Transform, Mel Frequency Cepstral Coefficients 
(MFCC), the Linear Prediction Coefficients (LPC) and the 
Warped Linear Prediction Coefficients. 

The feature vectors are then transferred to the 
classification stage. A classifier is trained using a known set 
of dataset. When presented with an unknown vocal segment, 
the classifier assigns the song to the class of problem. The 
commonly proposed classifiers are Hidden Markov Model 
(HMM), Neural Networks (NN), Support Vector Machines 
(SVM) and Gaussian Mixture Model (GMM) classifier. 

III. Acoustic Features Of Singing Voice 

There are many features that can be extracted from music 
signal. These features can be categorized into: reference 
features, content-based features and text-based features. A 
singing voice can be represented using content-based 
acoustic features which include timbral texture features, 
rhythmic content features and pitch content features. Based 
on these features, singing voice can be analyzed and 
classified. 

A. Pitch Features 

It is the perceived fundamental frequency of the sound; 
which refers to the actual value of the note sung. Pitch refers 
to the relative lowness or highness that we hear in a sound. 
The pitch contains features like Pitch Histogram (PH), Pitch 
Class Profile (PCP) and Harmonic Pitch Class Profile (HPCP) 
that describe the distribution of pitches. Features such as 
identifying the highest peak, the amplitude of the highest 
peak, and the period of the highest peak in the un-folded 
histogram, selecting the two highest peaks and then compute 
the distance between the two can be calculated from these 
pitch content features. Pitch histogram has been used in 
music genre and mood classification [9], [10] in early years of 
MIR research. Pitch Class Profile [11] and Harmonic Pitch 
Class Profile [12] are used in melody analysis and transcription 
[13], [14], [15]. 

B. Harmony Features 

One of the most discriminative elements to distinguee 
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singing voice from speech is harmonicity. In harmonic sound, 
the spectral components are the multiples of the lowest 
(fundamental) frequency. Due to the rapid vibration of the 
vocal folds, the singing voice is nearly always harmonic [16], 
and exhibits relatively large amounts of energy at integer 
multiples of the fundamental frequency in the low or middle 
frequency regions of the spectrogram. Compared to the 
singing voices, the instrumental-only sounds have less salient 
harmonics and spread their energy more widely. The harmonic 
spectrum is useful in differentiating between low and high 
pitch singers. 

C. Formant Features 

Formants are the meaningful frequency components of 
speech and singing. Formant frequencies are determined by 
the shape of vocal tract i.e. spectral content of singer's voice. 
The singer's formant indicates prominent sound energy near 
3 kHz and is the result of a clustering of the third, fourth, and 
fifth formants. This resonance, referred to as the singer's 
formant, adds a perceptual loudness that allows a singer's 
voice to be heard over a background accompaniment. Some 
approaches use the dynamics of FO's (most predominant 
frequency) trajectory, because a singing voice tends to have 
temporal variations in its F0 as a consequence of vibrato and 
such temporal information is expected to express the singer's 
characteristics. Trained singers often modify the formant 
structure of their voice in order to add certain desirable 
characteristics. For example, a lowered second formant results 
in a "darker" voice-often referred to as "covered"-while a 
raised second formant produces a "brighter" voice [17]. 
Trained singers (especially males) often create a resonance 
in the range of 3000 to 5000 Hz by employing a technique in 
which the larynx is lowered. 

D. Vibrato Features 

The pitch of normal speech ranges from 80 to 400 Hz, 
while that of singing can be from 80 to 1000 Hz. In singing, 
pitch may be further modulated using a frequency near 4-8 
Hz, which results in a phenomenon called vibrato [18]. Vibrato 
is expressed by vibrating the pitch of the singing voice and 
is defined as the periodic pitch fluctuation. Singers use vibrato 
to enhance the expressiveness of their performance. Singing 
vibrato, however, is an acquired vocal technique and usually 
requires years to master. Thus vocal vibrato can be seen as a 
function of the style of singing associated to a particular 
singer [19]. Some singers have an overly fast vibrato, called 
tremolo, while others have a wide and slow vibrato, called 
wobble. Thus vibrato can be considered as an important cue 
to distinguish between a well-trained singer and a mediocre 
singer [20]. The vibrato features consist of vibrato rate and 
vibrato extent. Vibrato rate is the speed of the pitch fluctuation 
and vibrato extent is the depth of the pitch fluctuation. The 
rate of vibrato is typically 5-8 Hz, and the modulation depth 
varies between +50 and +150 cents (where 1200 cents = 1 
octave) [21]. 

E. Timbre Features 
As a basic element of music, timbre is a term describing 
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the quality of a sound [22]. Timbre features are used to 
differentiate mixtures of sounds that have the same or similar 
rhythmic and pitch contents. Cleveland [23] states that an 
individual singer has a characteristic timbre that is a function 
of the laryngeal source and vocal tract resonances. Timbre is 
assumed to be invariant with an individual singer and there 
is a particular range of timbre quality associated to an 
individual singer. 

Many perceptual characteristics of timbre are evident in 
the spectral content or formant structure of a singer's voice. 
Thus extraction of timbre features is closely related to spectral 
analysis of the music signal and requires pre-processing of 
the signals, which follows some standard steps. Instead of 
doing song-level signal analysis directly in the first step, a 
song is usually split into statistically stationary frames, 
usually by applying a window function at fixed intervals to 
facilitate subsequent frame- level timbre feature extraction. 
The application of a window function removes the so-called 
"edge effects". After framing, spectral analysis techniques 
such as Fast Fourier Transform (FFT), Short Time Fourier 
Transform (STFT) and Discrete Wavelet Transform (DWT) 
are then applied to the windowed signal in each local frame. 
From the output magnitude spectra of STFT, some timbre 
features can be defined. Typical timbral features obtained by 
capturing simple statistics of the spectra include Spectral 
Centroid (SC), Spectral Rolloff (SR), Spectral Flux (SF), Energy, 
Zero Crossing and Spectral Bandwidth (SB) [26]. Using DWT 
a subband analysis can be performed by decomposing the 
power spectrum into subbands and by applying feature 
extraction in each subband to extract more powerful features 
such as MFCC, Octave based Spectral Contrast (OSC), and 
Daubechies Wavelet Coefficient Histogram (DWCH). Poli [24] 
measured the timbre quality from spectral envelope of MFCC 
features to identify singers. In [25], timbre is characterized by 
the harmonic lines of the harmonic sound. 

IV. Choosing The Acoustic Features 

The choice of audio features is much dependent on the 
task to be performed. Timbre features are suitable for genre 
and instrument classification but not appropriate for 
comparing the melody similarity of two songs. For mood 
classification, a large amount of work used rhythm features 
[27], [28], [29], [30]. While pitch and harmonic features are 
not quite popular with standard classification systems based 
on genre, artist, mood, etc., they are the most important feature 
types for song similarity retrieval and cover song detection 
at melodic level [31], [32], [33], [34], where timbre features fail 
to achieve good results. This is corroborated by a recent 
comparative study on music similarity [35], which showed 
that timbre features best explain for the instrumentation of 
the music. Different melodies played by the same instrument 
would produce more similar timbre features than those 
corresponding to the same melody with different 
instrumentation. In general, there is no single set of task- 
independent features that can consistently outperform the 
others. 
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V. Feature Extraction Methods 

The separation of vocal part from music accompaniment 
is potentially very challenging and is the key in providing 
better solution to singing voice analysis problem. If vital 
information is lost during this stage, the performance of the 
following classification stage is inherently crippled and can 
never measure up to human capability. Typically in singing 
voice analysis, for each song, spectral features are extracted 
from frames all over the song and are then clustered to group 
similar frames together. During feature extraction, the signals 
are changed into a sequence of feature vectors which are 
then transferred to the classification stage. If multiple features 
are available, they can be combined in an effective way to 
enhance the performance of the system. A good feature 
extraction approach should have following characteristics, it 
should be comprehensive (represent the music very well), 
compact (require much smaller storage space than the raw 
acoustic data), and efficient (require less computation for 
extraction) [36]. 

A. SpectralEnvelope Estimation 

In the music field, the signals are usually analyzed in 
frequency domain. The features of the music signal are always 
more apparent in frequency domain. The individual 
characteristics of vocal signals are noticeable in their spectral 
envelopes [37]. A spectral envelope estimates the vocal tract 
response. It is a curve which envelopes the magnitude of a 
short-time spectrum of a signal, linking the peaks or passing 
close to the maxima of non-sinusoidal spectra. 

An effective spectral envelope estimation technique must 
be capable of handling a wide range of signals with varying 
characteristics. It is important for a spectral envelope to 
provide a proper fit to the magnitude spectrum. A certain 
level of smoothness is desired for a spectral envelope. It 
should not oscillate erratically, but instead should give a 
general idea of the distribution of the signal's energy. As the 
spectral envelope is defined relative to a short segment of 
the signal (typically between 10 to 50 ms), it should also 
possess consistency from frame to frame. 

Estimation of spectral envelope is the task of deriving 
spectral envelopes from a given signal. The spectral envelope 
estimation methods are Linear Predictive Coding, Cepstrum 
and Discrete Cepstrum. 

Linear Predictive Coding (LPC): LPC is an efficient 
autoregressive class and essentially built up a spectral 
envelope as the transfer function of an all-pole filter with 
order 'p' poles. LPC can efficiently indicate the characteristics 
of harmonic components of the audio signal. Since over 90% 
of the singing signal is harmonic, LPC is also a good choice 
for representing the features of a singing voice. The spectral 
envelope extracted using LPC precisely represents formants 
of singing voice. 

Cepstrum Spectral Envelope: The Cepstrum is a method 
of speech analysis based on a spectral representation of the 
signal. After achieving the spectral envelope of the signal, it 
is possible to analyse the envelope to finds its peak which 
can provide important data about most relevant formants. 
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The Cepstral coefficients derived from LPC analysis has 
proved to be more robust to noises than the FFT-derived. 
Cepstral coefficients thus are more appropriate to be used 
with the singing signal which is mixed with instrumental 
sounds [37]. 

B. Mel-Frequency Cepstrum Coefficient(MFCC) 

The MFCCs are efficient audio descriptors designed to 
capture short-term spectral-based features providing spectral 
energy measurements over short time windows. In order to 
calculate MFCCs, the signal is first broken into overlapping 
frames, each approximately 25ms long, a time scale at which 
the signal is assumed to be stationary. The log-magnitude of 
the discrete Fourier transform of each window is warped to 
the Mel frequency scale. A discrete cosine transform (DCT) 
is then applied and the lower coefficients of the DCT are 
used to represent a rough shape of the spectrum. By choosing 
a proper order of the MFCC feature vector, the characteristics 
of a human voice can be effectively revealed. 

The MFCC have been the most successful acoustic 
features in speech and speaker recognition systems. They 
have also been successfully used in music signals for artist 
identification, instrument identification and genre 
classiucation. 

C. Wavelet Transform 

Like Fourier transforms, a wavelet transform is viewed as 
a tool for dividing signals into different frequency 
components and then analysing each component with a 
resolution matched to its scale [38]. Wavelets are designed 
to give good time resolution at high frequencies and good 
frequency resolution at low frequencies. After the wavelet 
decomposition, histogram of each sub-band is constructed. 
A wavelet coefficients histogram is the histogram of the 
(rounded) wavelet coefficients obtained by convolving a 
wavelet filter with an input music signal. Using the wavelet 
histogram one can calculate statistical features for each sub- 
band; like the sub-band energy, defined as the mean of the 
absolute value of coefficients, and the first three moments, 
i.e., the average, the variance, and the skewness. The 
histograms of wavelet coefficients, gives a good estimation 
of the probability distribution over time and thus leads to a 
good feature representation. Wavelet Transform is used 
widely in many applications of MIR like classiucation, 
similarity, pitch-detection, beat-tracking and indexing 
problems. 

VI. Classifiers 

The purpose of classifier learning is to find a mapping 
from the feature space to the output labels by taking accurate 
decisions so as to minimize the prediction error. The common 
choices of classifiers are K-nearest neighbor, support vector 
machine, and GMM classifier. Various other classifiers have 
also been used for different music related tasks, including 
logistic regression, Artificial Neural Networks (ANN), 
decision trees, Linear Discriminant Analysis (LDA), Nearest 
Centroid (NC), and Sparse Representation -based Classifier 
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(SRC). 

A. K-Nearest Neighbor (K-NN) Classifer 

The K-Nearest Neighbors algorithm is the simplest 
machine learning algorithm, which identifies the object by 
the majority vote of its neighbors based on distance (usually 
using a Euclidean distance). Given an input feature vector 
the algorithm finds k closest feature vectors representing 
different classes. The disadvantage of K-NN classifier is that 
its accuracy relies on the selection of an optimum number of 
neighbors and the most suitable distance measuring method. 
K-NN has been applied to various music sound analysis 
problems. 

B. Support Vector Machine (SVM) 

SVM is the state-of-the-art binary classifier based on the 
large margin principle and it works well with high- dimensional 
data [39]. Intuitively, it aims at to construct a hyperplane that 
divides a data set into n regions, where n is the number of 
class labels in the data set. These hyperplane simplify to a 
set of Lagrange multipliers for each training case, and the set 
of points within the dimensional vectors fed for training that 
have non-zero Lagrangians are the support vectors. The 
machine saves these support vectors and applies them to 
new data in the form of the test set for further on-line 
classification. Therefore, the SVM has good classiucation 
performance since it focuses on the difficult instances. 

C. Gaussian Mixture Model (GMM) 

The Gaussian mixture model uses multiple weighted 
Gaussians to attempt to capture the behavior of each class of 
training data. The use of multiple Gaussians is particularly 
beneficial when analyzing data that has a distribution not 
well modeled by a single cluster. It is known that GMMs 
provide good approximations of arbitrarily shaped densities 
of a spectrum over a long span of time [40], and hence can 
reflect the vocal tract configurations of individual singing 
voice. Hence it is a very flexible model that can adapt to 
encompass almost any distribution of data. Test points are 
classified by a maximum likelihood discriminant function, 
calculated by their distances from the multiple Gaussians of 
the class distributions [41]. To determine the parameters of 
the Gaussians that best model each class, a well-known 
technique of Expectation Maximization (EM) is used. EM is 
an iterative algorithm that converges on parameters that are 
locally optimal according to the log-likelihood function. It is 
also useful to perform Principle Components Analysis (PCA) 
prior to EM. PCA is a multi-dimensional rotation of the data 
onto the axes of maximal variance. It also has the added benefit 
of normalizing the data variances, which avoids highly 
different scaling among the dimensions, which is problematic 
for EM. 

VII. Performance Parameters 

Performance parameters are essential to evaluate the sys- 
tem. Some of these, commonly defined in the applications of 
MIR are: 
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A. Accuracy 

Every system should compute its accuracy defined as 

A ccu - dumber of correctly identified test samples 
Total number of samples 

(1) 

B. False Alarm Rate (FAR) 

FAR is defined as the number of false alarms divided by 
the total number of target frames [3], [4]. 

Framesfalsely detected as target 

tAR= 

Total frames labelled as target 

C. Miss Detection Rate (MDR) 

The miss detection rates are reported as the number of 
misidentified test samples divided by the number of total test 
samples [3], [4]. 

, Frames labelled as target but undetected 

MDR = - 

Total target frames 

(3) 

VIII. Applications 

The ability to capture parameters associated with vocal 
qualities of singing voice can be applied to a number of tasks 
in MIR. Some of the applications that have a potential area of 
research are mentioned below: 

• Perform singer identification task to determine who among 
a group of candidate singers sang a given part of song. 

• Evaluation or assessment of the performer's singing 
ability (performance) in terms of technical accuracy and 
assigning it a rating score. 

• Provide a detail characterization of a particular singing 
voice to classify it according to skill, style, gender, 
register, and vocal texture. 

• Identify trained and untrained singers by analyzing their 

acoustic features. 

• Perform classical enhancements on the singing voices 
of untrained singers. 

• Perform the singer verification task to decide whether or 
not a claimed singer performed a given song. 

• Convert the music retrieval problem to text retrieval by 
labeling songs with appropriate tags, substituting songs 
with text annotations. 

A summary of the discussions in the sections III - VI and 
VIII is put down in Table I. 

Table I. lists the different applications based on singing 
voice and maps the application to its best suitable acoustic 
feature, the extraction method of that feature and the 
appropriate classifier. 

Conclusions 

The vast amount of music accessible to the general public 
calls for developing tools to effectively and efficiently retrieve 
and manage the music of interest to the end users. As the 

©2013ACEEE 
DOL03.LSCS.2013.3.520 



Table I. List of Applications Based on Signing Voice Analysis 



MIR 
Application 


Acoustic Features 


Extraction 
Method 


Classi 
-fier 


Used in 


Singer 
Identification 


Formant features, 
harmonic features, 
timbre 


MFCC 
features, 
LPC 


GMM 
SVM 
HMM 


[l]-[4],[8], 
[19], [37] 


Singer 
verification 


Timbral Features 


Cepstrum 
Coefficient 


GMM 


[42] 


Identify 
trained/untrai 
ned singer 


Vibrato, 
Formant features 


LPC, 
Cepstrum 
Coefficient 


GMM 


[20], [43], 
[44] 


Music 
Annotation 


Delta- MFCC x 
MuVar 


MFCC 


GMM 
SVM 


[36] 


Signal 
enhancement 


MFCC 


MFCC 


GMM 


[20], [45] 



singing voice is the basic element of a song that attracts the 
most attention of listeners; organizing, browsing and 
classifying music signals based on singing voice is useful 
for MIR systems. 

The complexity of the MIR classification increases with 
the amount of features used within the classifier. It is therefore 
crucial to understand the acoustic features and accordingly 
select only the most relevant features in order to increase the 
performance of MIR system. The paper has collectively 
described the acoustic features of singing voice, their 
extraction methods and the classifiers. Also the paper has 
put forward some application areas that explore singing voice 
analysis. 
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